# Lecture 13 Cache Memories

CS213 – Intro to Computer Systems Branden Ghena – Winter 2022

Slides adapted from:

St-Amour, Hardavellas, Bustamente (Northwestern), Bryant, O'Hallaron (CMU), Garcia, Weaver (UC Berkeley)

## **Announcements**

Homework 3 due today

- Attack Lab due next week
  - Get started ASAP!

- Be sure to check out the Campuswire post on "Attack Lab Even Return Requirement"
- PM Huaxuan Chen did an overview of the Attack Lab
  - Slides posted to Campuswire, and recording on Canvas->Panopto

# Today's Goals

- Discuss organization of various cache designs
  - Direct-mapped caches
  - N-way set-associative caches
  - Fully-associative caches

 Understand how cache memories are used to reduce the average time to access memory

## **Outline**

Cache Organization

Associativity

Cache Performance

## Caching speeds up code

- Cache: smaller, faster storage device that keeps copies of a subset of the data in a larger, slower device
  - If the data we access is already in the cache, we win!
  - Can get access time of faster memory, with overall capacity of larger
- Locality helps predict which data code is likely to access
  - So want to design caches to take advantage of it!
    - Most code has good locality
    - Well-written code has great locality!
  - Spatial locality: if you need a byte, you're likely to need its neighbors
    - Caches should load whole blocks, not single bytes!
  - Temporal locality: if you need a byte, you're likely to need it again
    - Caches should try to keep recently cached data in the cache!

## Cache memories

- A specific instance of the general principle of caching
  - Small, fast SRAM-based memories between CPU and main memory
  - Can include multiple levels
    - L1 = small, but really fast, L2 = larger, slower, L3, etc.
- CPU looks for data in caches first
  - e.g., L1, then L2, then L3, then finally in main memory as a last resort
- Mechanisms we'll see today are implemented in hardware
- Typical system structure:



# How You Probably Thought a Memory Access Worked



## How a Memory Access Actually Works



# General Cache Organization (S, A, B)



## Cache Access



## Cache Read (1): Locate Set

Locate set



Each address maps to a particular set!

Data has to be stored at that particular set!

Even if that set is full and there "is space" elsewhere!

(That's where conflict misses come from.)

#### Cache Read (2): Tag Match + Valid Locate set Locate block in set • Tag matches + valid bit set **A** lines per set → Cache Hit! Address of word: 0x1E45 0xFF b bits $K = 2^{s}$ sets block tag set index offset 0xFF Within a set, could be anywhere! So, need to check all lines! 0x1E45 But if it's not in that set, it's not in the valid bit cache at all! (It's the only place it could be.)

# Cache Read (3): Block Offset



# Example: 128 sets, 64 bytes per block



## Cache access overview



## What about writes?

- Multiple copies of data exist:
  - L1, L2, Main Memory, Disk
  - Don't want them to get (or at least not to stay) out of sync!
    - Otherwise, who do you believe?

Multiple configuration options that a cache could have

## Write configurations

- What to do on a write-hit?
  - *Write-through* (write immediately to memory)
  - Write-back (delay write until we evict this cache block)
    - Need a dirty bit (indicate if line differs from memory)
    - We had an example of that last time
- What to do on a write-miss?
  - Write-allocate (load into cache, update line in cache)
    - Good if more writes to the location follow
  - No-write-allocate (writes immediately to memory, doesn't bring into cache)
- Typical
  - Write-back + Write-allocate ← by far the most common
  - Write-through + No-write-allocate

- 64-bit, byte-addressed system
- 32 kB cache
  - 512 sets and 64-byte blocks
- How many bits for Tag?
  - A: 6 bits
  - B: 9 bits
  - C: 17 bits
  - D: 49 bits

#### Address of word:



- 64-bit, byte-addressed system
- 32 kB cache
  - 512 sets and 64-byte blocks

#### Address of word:



- How many bits for Tag? (6 bits for block, 9 bits for set)
  - A: 6 bits
  - B: 9 bits
  - C: 17 bits
  - **D:** 49 bits (Tag is remaining bits. 64 6 9 = 49)

## **Outline**

Cache Organization

Associativity

Cache Performance

## Cache memory associativity

- When designing a cache, a number of parameters to choose
  - Total size (C), cache block size (B), number of sets (K), ...
- The most interesting one: associativity (A)
  - i.e., how many cache blocks per set
  - Has a significant impact on effectiveness (and complexity!)

## Associativity choices

- Associativity 1 → direct-mapped caches
  - One cache block per set, blocks can only go in that one block
  - Whenever we place data in a set, must evict whatever is there
- Associativity  $>1 \rightarrow$  **set-associative caches** 
  - Can keep multiple cache blocks that would map to the same set
- Single set → fully-associative caches
  - Any cache block can go anywhere, 1 big set, tag is all that matters
  - Very rare for cache memories due to expensive hardware

Direct mapped: One line per set Assume: cache block size 8 bytes



Direct mapped: One line per set Assume: cache block size 8 bytes



Direct mapped: One line per set Assume: cache block size 8 bytes



Direct mapped: One line per set Assume: cache block size 8 bytes



If tag doesn't match or valid bit is not set: cache miss!

→ old line is evicted and replaced with currently requested one

## Direct-mapped cache simulation

| t=1 | s=2 | b=1 |
|-----|-----|-----|
| Х   | XX  | Х   |

Address trace (reads, one byte per read):  $0 [0 00 0_2]$  miss

**P**154.5



M=16 addresses, byte-addressable

B=2 bytes/block

K=4 sets

A=1 blocks/set

7 [0 **11** 1<sub>2</sub>] miss

1 [0 00 1<sub>2</sub>]

 $8 [1 00 0_2]$  miss

 $0 [0 00 0_2]$  miss

## What are the types of each miss here?

| t=1 | s=2 | b=1 |
|-----|-----|-----|
| Х   | XX  | Х   |

M=16 addresses, byte-addressable

B=2 bytes/block

K=4 sets

A=1 blocks/set

Address trace (reads, one byte per read):  $0 [0 00 0_2]$  miss Compulsory

Miss

 $1[0 \ 00 \ 1_2]$  hit

| 7 [0 | [0 <b>11</b> 1 <sub>2</sub> ] | miss | Compulsor |      |
|------|-------------------------------|------|-----------|------|
|      |                               |      |           | Miss |

8 [1 00 0<sub>2</sub>] miss Compulsory

 $0 \begin{bmatrix} 0 & 00 & 0_2 \end{bmatrix}$  miss Conflict

|                     | V | tag | blo  | ck   |
|---------------------|---|-----|------|------|
| set 00 <sub>2</sub> | 1 | 0   | m[1] | m[0] |
| set 01 <sub>2</sub> | 0 |     |      |      |
| set 10 <sub>2</sub> | 0 |     |      |      |
| set 11 <sub>2</sub> | 1 | 0   | m[7] | m[6] |

#### Options:

- Compulsory
- Capacity
- Conflict

#### **Conflict misses**:

There is "room" in the cache, but two blocks map to the same set; one evicts the other! Pause for questions on direct-mapped caches

## Associativity choices

- Associativity 1 → direct-mapped caches
  - One cache block per set, blocks can only go in that one block
  - Whenever we place data in a set, must evict whatever is there
- Associativity  $>1 \rightarrow$  **set-associative caches** 
  - Can keep multiple cache blocks that would map to the same set
- Single set → fully-associative caches
  - Any cache block can go anywhere, 1 big set, tag is all that matters
  - Very rare for cache memories due to expensive hardware

# 2-way set-associative cache (associativity = 2)



# 2-way set-associative cache (associativity = 2)

A = 2: Two lines per set

Assume: cache block size 8 bytes

Address of short int:

compare both

valid? + tag match? if yes → hit

v tag 7 6 5 4 3 2 1 0 v tag 7 6 5 4 3 2 1 0

block offset

The data we want is either on the left, or on the right, or not in the cache at all. It can't be anywhere else! Addresses map to a single set!

# 2-way set-associative cache (associativity = 2)

A = 2: Two lines per set



short int is here (2 bytes)

#### If no match:

- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...
  - More clever → lower miss rate, but harder to implement in hardware

## 2-way set-associative cache simulation

M=16 addresses, byte-addressable, B=2 bytes/block, K=2 sets, A=2 blocks/set Same total size and block size as before. Associativity (and thus # of sets) changed.

| t=2 | s=1 | b=1 |
|-----|-----|-----|
| XX  | Х   | X   |

Address trace (reads, one byte per read):

| 0 | [00] | 0 | $0_2$ ]          | miss |
|---|------|---|------------------|------|
| 1 | [00] | 0 | $1_{2}^{-}$ ]    | hit  |
| 7 | [01  | 1 | $1_{2}^{-}$ ]    | miss |
| 8 | [10  | 0 | $0_{2}^{-}$ ]    | miss |
| 0 | [00] | 0 | 0 <sub>2</sub> ] | hit  |

The same address sequence in the direct mapped cache resulted in:
miss
hit Higher associativity =
Less likely to have to evict!
miss
miss
Temporal locality: want data in cache to stay in cache!

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[1-0] |
|       | 1 | 10  | M[9-8] |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 1 | 1 | 01  | M[7-6] |
|       | 0 |     |        |

Pause for questions on set-associative caches

## Fully-associative caches

- What changes with fully-associative caches?
  - Anything can go anywhere
  - Only one set (s = 0 bits)

- Otherwise, same steps as for a set-associative cache
  - Compare tag against all blocks in the set

- Fully-associative cache on a 16-bit system
  - One set (fully associative!)
  - Eight, 64-byte blocks



- Are the following addresses in the cache?
  - 0x0400
  - 0x0410
  - 0xC002
  - 0xC048

- Fully-associative cache on a 16-bit system
  - One set (fully associative!)
  - Eight, 64-byte blocks



- Are the following addresses in the cache?
  - 0x0400⇒0b0000 0100 0000 0000
  - 0x0410⇒0b0000 0100 0001 0000
  - 0xC002⇒0b1100 0000 0000 0010
  - 0xC048⇒0b1100 0000 0100 1000

- Fully-associative cache on a 16-bit system
  - One set (fully associative!)
  - Eight, 64-byte blocks



- Are the following addresses in the cache?

  - $0x0410 \Rightarrow 0b0000 0100 00001 00000 \rightarrow Tag 0x010 (same block!)$
  - 0xC002⇒0b<u>1100 0000 00</u>00 0010
  - 0xC048⇒0b<u>1100 0000 01</u>00 1000

HTT

- Fully-associative cache on a 16-bit system
  - One set (fully associative!)
  - Eight, 64-byte blocks

 Tag: 0x000
 Tag: 0x1FF
 Tag: 0x010
 Tag: 0x011
 Tag: 0x050
 Tag: 0x051
 Tag: 0x052
 Tag: 0x300

- Are the following addresses in the cache?
  - $0x0400 \Rightarrow 0b0000 0100 00000 00000 \rightarrow Tag 0x010$

  - $0xC002 \Rightarrow 0b\underline{1100\ 0000\ 00}00\ 0010 \rightarrow Tag\ 0x300$
  - 0xC048⇒0b<u>1100 0000 01</u>00 1000 → Tag 0x301 (different block!) **MISS**

## **Associativity Pros and Cons**

## Direct-mapped

- Simplest to implement: look-up compares tag with 1 cache line

   → requires fewer transistors, which can be used elsewhere on the chip
- Conflicts can easily lead to thrashing
  - Two cache lines map to the same set, program needs both, and they keep kicking each other out of the cache. Lots of misses. Bad times.

#### Set-associative

- More complex implementation: requires more (HW) tag comparators
- Lower miss rate than direct-mapped caches (fewer conflict misses)
  - 2-way is a significant improvement over direct-mapped
  - 4-way is a more modest improvement over 2-way, and so on

## Fully-associative

- One comparator per cache line in the cache. Ouch.
  - Often a deal-breaker for hardware
- Very low miss rate!

## Intel Core i7 Cache Hierarchy

Processor package Core 0 Core 3 Regs Regs L1 L1 L1 d-cache i-cache d-cache i-cache L2 unified L2 unified cache cache L3 unified cache (shared by all cores)

Main memory

L1 i-cache and d-cache:

32 KB, 8-way,

Access: 4 cycles

Keep separate caches for instructions and data. Don't want them to step on each other's toes!

L2 unified cache:

256 KB, 8-way,

Access: 11 cycles

L3 unified cache:

8 MB, 16-way,

Access: 30-40 cycles

Last resort before going to main memory (slow!) So want this large and highly-associative, to have very few misses.

Block size: 64 bytes for all caches.

## **Outline**

Cache Organization

Associativity

Cache Performance

## Cache Performance Metrics

#### Miss Rate

- Fraction of memory references not found in cache (misses / accesses) = 1 hit rate
- Typical numbers (in percentages):
  - 3-10% for L1
  - Can be quite small (e.g., < 1%) for L2, depending on dataset size, etc.</li>
  - However, many applications have >30% miss rate in L2 cache

#### Hit Time

- Time to deliver a line in the cache to the processor
  - Includes time to determine whether the line is in the cache
- Typical numbers:
  - 1-2 clock cycles for L1
  - 5-20 clock cycles for L2

### Miss Penalty

- Additional time required because of a miss
- Typically 50-200 cycles for main memory
  - Not really a "penalty", just how long it takes to read from memory

## Let's think about those numbers

- Huge difference between a hit and a miss
  - Could be 100x, if comparing L1 and main memory
- Would you believe a 99% hit rate is twice as good as 97%?
  - Consider: cache hit time of 1 cycle miss penalty of 100 cycles
  - Average access time:

```
97% hits: 100 instructions: 100 cycles (1 per instruction) + 3*100 (misses) on average: 1 cycle/instr. + 0.03 * 100 cycles/instr. = 4 cycles/instr. 99% hits: on average: 1 cycle/instr. + 0.01 * 100 cycles/instr. = 2 cycles/instr.
```

- This is why "miss rate" is used instead of "hit rate"
  - In our example, 1% miss rate vs. 3% miss rate
  - Makes the radical performance difference more obvious
- "Computation is what happens between cache misses."

## Average Memory Access Time (AMAT)

- AMAT = Hit time + Miss rate  $\times$  Miss penalty
  - Generalization of previous formula
- Can extend for multiple layers of caching
  - AMAT = Hit Time L1 + Miss Rate L1 × Miss Penalty L1
    - Miss Penalty L1 = Hit Time L2 + Miss Rate L2 × Miss Penalty L2
    - Miss Penalty L2 = Hit Time Main Memory

Multi-level caching helps minimize AMAT

## **Outline**

Cache Organization

Associativity

Cache Performance